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We  present  a  mutual  exclusion  algorithm  that  performs  well  both  with  and  without  contention,  on 
machines  with  no  atomic  instructions  other  than  read  and  write.  The  algorithm  capitalizes  on  the  ability 
of  memory  systems  to  read  and  write  at  both  full-  and  half-word  granularities.  It  depends  on  predictable 
processor  execution  rates,  but  requires  no  bound  on  the  length  of  critical  sections,  performs  only  0(n) 
total  references  to  shared  memory  when  arbitrating  among  conflicting  requests  (rather  than  0(n*)  in  the 
general  version  of  Lamport’s  fast  mutual  exclusion  algorithm),  and  performs  only  2  reads  and  4  writes 
(a  new  lower  bound)  in  the  absence  of  contention.  We  provide  a  correctness  proof. 

We  also  investigate  the  utility  of  exponential  backoff  in  fast  mutual  exclusion,  with  experimental 
results  on  the  Silicon  Graphics  Iris  multiprocessor  and  on  a  larger,  simulated  machine.  With  backoff  in 
place,  we  find  that  Lamport’s-algorithm,  our  new  algorithm,  and  a  recent  algorithm  due  to  Alur  and 
Taubenfeld  all  work  extremely  well,  outperforming  the  native  hardware  locks  of  the  Silicon  Graphics 
machine,  even  with  heavy  contention. 


1  Introduction 

Many  researchers  have  addressed  the  problem  of  n-process  mutual  exclusion  under  a  shared-memory  pro¬ 
gramming  model  in  which  reads  and  writes  are  the  only  atomic  operations.  Early  solutions  to  the  problem 
entail  a  lock  acquisition/releaso  protocol  in  which  each  process  that  wishes  to  execute  the  critical  section 
makes  fifn)  references  to  shared  memory,  where  n  is  the  total  number  of  processes  [3,  7]. 

On  the  assumption  that  contention  is  relatively  rare,  Lamport  in  1987  suggested  two  mutual  exclusion 
algorithms  [4]  in  which  a  process  performs  only  a  constant  number  of  shared  memory  references  in  its 
acquisition/release  protocol,  so  long  as  no  other  process  attempts  to  do  so  simultaneously.  The  first  algorithm 
requires  a  bound  on  the  length  of  a  critical  section  (which  is  not  always  possible),  and  a  bound  on  the  relative 
rates  of  process  execution.  The  second  algorithm  performs  n(n*)  total  references  to  shared  memory  (^(n) 
in  each  process)  when  attempting  to  arbitrate  among  n  concurrent  lock  acquisition  attempts  (see  section  2). 

We  have  developed  an  algorithm  that  retains  the  0(1)  bound  of  Lamport’s  algorithms  in  the  absence  of 
contention,  while  arranging  to  elect  a  winner  after  only  0{n)  shared  memory  references  in  the  presence  of 
contention.  Like  Lamport’s  first  algorithm,  the  new  algorithm  requires  a  bound  on  relative  rates  of  process 
execution;  it  does  not  however  require  a  bound  on  the  length  of  a  critical  section. 

Several  researchers  have  recently  presented  algorithms  with  similar  characteristics.  These  are  summarized 
in  table  1.  The  first  column  of  the  table  indicates  the  number  of  references  a  process  makes  to  shared  memory 

'This  work  was  supported  in  part  by  NSF  Institutional  Infrastructure  award  number  CDA-8832734,  NSF  gran)  number 
CCn-000.5633,  and  ONR  research  contract  number  N000H-92-J-18O1  (in  conjunction  with  the  ARPA  Research  in  Information 
Science  and  Technologv — Higit  Performance  Computing,  Software  Science  and  Technology  program,  ARPA  Order  No.  8930). 
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1 

Algorithm 

shared  me 
to  ch( 

no  contention 

■mory  references 

30.se  winner 

1  contention 

needs 

speed 

bound? 

comments 

Lamport  1  [4] 

2  reads, 

3  writes 

0(n) 

yes 

requires  bound  on 
critical  section  length 

Lamport  2  [4] 

Q(n2) 

no 

Styer  [8] 

3  reads, 

(4  + 1)  writes 

n(/n2/') 

no 

/  can  be  chosen  anywhere 
in  (0(l),0(logn)) 

Yang  and 
Anderson  1  [10] 

O(logn) 

O(logn) 

no 

starvation  free 
no  remote  spins 

Yang  and 
Anderson  2  [10] 

6  reads, 
9'writes“ 

0(n) 

[O(logn)  “typical”] 

no 

starvation  free 
no  remote  spins 

Alur  and 
Taubenfeld  [1] 

3  reads, 

5  writes 

0(n) 

yes 

new 

2  reads, 

4  writes 

0(n) 

yes 

requires  multi-grain 
atomic  reads  and  writes 

Table  1 :  Comparative  characteristics  of  fast  mutual  exclusion  algorithms. 

“With  appropriate  assignment  of  variables  to  local  memory  locations,  5  of  the  9  writes  need  not  traverse  the 
processor-memory  interconnection  network. 


when  acquiring  and  releasing  a  lock  for  which  there  is  no  contention.  The  second  column  indicates  the  number 
of  references  that  may  need  to  execute  sequentially  in  order  for  some  process  to  enter  its  critical  section  when 
n  processes  wish  to  do  so.  (This  notion  of  “time”  differs  from  that  of  most  other  researchers;  we  assume  that 
references  may  serialize  if  they  are  made  by  the  same  process  or  require  the  use  of  the  same  memory  bank  or 
communication  link.)  Our  algorithm  performs  fewer  shared  memory  references  than  any  but  the  bounded- 
critical-section  version  of  Lamport’s  algorithm.  It  is  also  substantially  simpler  than  the  algorithms  of  Styer 
or  Yang  and  Anderson,  both  of  which  employ  a  hierarchical  collection  of  sub-n-process  locks.  There  is  a 
strong  resemblance  between  our  algorithm  and  that  of  Alur  and  Taubenfeld,  though  the  two  were  developed 
independently.  In  effect,  we  reduce  the  number  of  shared-memory  operations  by  exploiting  the  ability  of 
most  memory  systems  to  read  and  write  atomically  at  both  full-  and  half-word  granularities. 

Following  the  presentation  of  our  algorithm  in  section  2,  we  present  a  correctness  proof  in  section  3, 
experimental  performance  results  in  section  4,  and  conclusions  in  section  5.  In  our  experiments,  we  employ 
limited  exponential  backoff  to  reduce  the  amount  of  contention  caused  by  concurrent  attempts  to  acquire 
a  lock.  This  technique,  originally  suggested  by  T.  Anderson,  works  very  well  for  test_and-set  locks  [2,  5], 
and  our  results  show  it  to  be  equally  effective  for  locks  based  on  reads  and  writes.  In  fact,  on  our  Silicon 
Graphics  multiproce.ssor,  fast  mutual  exclusion  algorithms  with  backoff  (and  the  new  algorithm  in  particular) 
outperform  the  native  hardware  spin  locks  by  a  significant  margin,  with  or  without  contention.  Results  on 
a  larger,  simulated  machine  also  show  the  new  algorithm  outperforming  both  Lamport's  second  algorithm 
and  Alur  and  Taubenfeld's  algorithm. 


2  Algorithms 

Lamport  [4]  presents  two  mutual  exclusion  algorithms.  Both  allow  a  process  to  enter  its  critical  section  in 
constant  time.  The  first  algorithm  requires  a  bound  on  the  relative  rates  of  execution  of  different  processes, 
and  on  the  time  required  to  execute  critical  sections.  In  the  absence  of  contention  a  process  requires  five 
accesses  to  shared  memory  to  acquire  and  release  the  lock.  Process  i  executes  the  code  on  the  left  side  of 
figure  1.  Variable  Y  is  initialized  to  free,  and  the  delay  in  line  7  is  assumed  to  be  long  enough  for  any 
process  that  has  already  read  V'  =  free  in  line  3  to  complete  lines  5,  6,  and  (if  appropriate)  10  and  II. 

The  second  algorithm  does  not  require  any  bounds  on  execution  rates  or  lengths  of  critical  sections.  In 
the  absence  of  contention  a  process  requires  seven  accesses  to  shared  memory  to  acquire  and  release  the  lock. 
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1:  START: 

2:  A'  -  i 

3:  if  y  ^  free 

4:  goto  START 

5:  Y  -  i 

6:  if  A  7^  f 

7:  {  delay  } 

8;  if  y  ^  ,r 

9:  goto  START 

10:  {  critical  section  } 

11:  y  <— free 

12:  {  non-critical  section  } 

13:  goto  START 


1:  start: 

2:  j5[?]  true 

3:  A'  *-  > 

4:  if  y  ^  free 

5:  fl[j]  <—  false 

6:  repeat  until  y  =  free 

7:  goto  start 

8:  y  -  i 

9:  if  A  ^  i 

10:  B[j]  •—  falue 

11:  for  j  •—  1  to  A 

12:  repeat  while  B\j\ 

13:  ify^^s 

14:  repeat  until  V'  =  free 

15:  goto  start 

16:  {  critical  section  } 

17:  y  »—  free 

18:  fl[»] »—  false 

19:  {  non-critical  section  } 

20:  goto  START 


Figure  1:  Lamport’s  fast  mutual  exclusion  algorithms. 

and  0(n^)  time  with  contention.'  Process  i  executes  the  code  on  the  right  side  of  figure  1.  Variable  y  is 
initialized  to  free  and  each  element  of  the  B  array  is  initialized  to  false. 

We  have  devised  a  new  mutual  exclusion  algorithm  that  allows  a  process  to  enter  its  critical  section  with 
only  six  shared  memory  references  in  the  absence  of  contention.  In  the  presence  of  contention,  it  requires 
0(n)  time.  As  in  Lamport’s  first  algorithm,  we  assume  a  bound  on  relative  rates  of  process  execution.  Such 
an  assumption  is  permissible  if  the  algorithm  is  executed  by  an  embedded  system,  or  by  an  operating  system 
routine  that  executes  with  hardware  interrupts  disabled.  We  do  not,  however,  require  a  bound  on  the  length 
of  critical  sections.  Process  i  in  our  algorithm  executes  the  code  in  figure  2.  Variables  Y  and  F  are  initialized 
to  free  and  out,  respectively.  They  are  assumed  to  occupy  adjacent  half-words  in  memory,  where  they  can 
be  read  or  written  either  separately  or  together,  atomically.  The  delay  in  line  7  is  assumed  to  be  long  enough 
for  any  process  that  has  already  read  Y  =  free  in  line  3  to  complete  line  5,  and  any  process  that  has  already 
set  y  in  line  5  to  complete  line  6  and  (if  not  delayed)  line  10. 

A  similar  algorithm,  due  to  Alur  and  Taubenfeld  [1],  appears  in  figure  3.  Rather  than  read  and  write 
at  multiple  granularities,  this  algorithm  relies  on  an  additional  flag  variable  (Z)  to  determine  whether  any 
process  has  entered  the  critical  section  by  the  end  of  the  delay.  When  releeising  the  lock,  process  i  first  clears 
Z,  and  then  clears  Y  only  ifY  still  equals  i.  If  Y  has  changed,  the  last  process  to  change  it  is  permitted  to 
enter  the  critical  section  as  soon  as  Z  is  cleared.  Both  our  algorithm  and  Alur  and  Taubenfeld’s  assume  a 
bound  on  relative  rates  of  process  execution,  with  identical  delays  on  the  slow  code  path,  when  contention 
is  detected.  Both  algorithms  require  only  0(n)  time  when  arbitrating  among  n  concurrent  lock  acquisitions, 
and  0(1)  time  in  the  absence  of  contention.  On  the  fast  code  path,  however,  our  algorithm  performs  25% 
fewer  shared  memory  references.  As  shown  in  section  4,  this  translates  not  only  into  lower  overhead  in  the 
no-contention  case,  but  also,  given  backoff,  in  most  cases  of  contention  as  well. 

'As  noted  in  section  1,  we  assume  that  a  reference  to  shared  memory  may  take  time  linear  in  the  number  of  processes 
attempting  to  access  the  same  location  concurrently. 


3 


1:  START: 

2:  A'  ^  i 

3:  if  y  ^  free 

4:  goto  START 

5:  y  ^  i 

6:  If  A  t 

7:  {  delay  ) 

8:  if  (y.F)  7^(1,  out) 

9:  goto  START 

10:  F  ^  in 

11:  {  critical  section  } 

12:  (y,F)^  (free,  out) 

13:  {  non-critical  section  } 

14:  goto  START 


Figure  2:  A  new  fast  mutual  exclusion  algorithm. 


1: 

2: 

3: 

4: 

5: 

6: 

7: 

8: 

9: 

10 

11 

12 

13 

14 

15 
IG 
17 


START: 

•Y  ^  t 

repeat  until  Y  =  free 

y  -  i 

if 

{  delay  } 

if  y  f 

goto  START 
repeat  until  Z  =  0 

else 

Z  -  1 

{  critical  section  ) 

Z^O 

if  y  =  i 

Y  *—  free 

{  non-critical  section  } 
goto  START 


Figure  3;  Alur  and  Taubenfeld’s  algorithm. 
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3  Correctness 


In  this  section  we  present  proofs  of  mutual  exclusion  and  livelock  freedom  for  our  algorithm. 

As  it  concerns  the  algorithm,  each  process  i  can  be  conceptualized  as  a  sequence  of  non-looping  subpro¬ 
cesses.  Thus,  the  execution  time  of  a  subprocess  is  bounded  except  for  the  critical  section.  A  subprocess 
either  acquires  the  lock,  executes  the  critical  section,  releases  the  lock,  and  terminates;  or  fails  to  acquire 
the  lock  at  some  point,  terminates,  and  the  next  subprocess  begins  execution  from  START.  It  is  clear  that 
at  any  moment  each  process  has  at  most  one  subprocess  running. 

Let  U  be  the  set  of  (non-looping)  subprocesses  running  at  time  i.  U  can  be  partitioned  into  five  disjoint 
sets  A,  B,  C,  D,  and  E,  defined  in  terms  of  the  truth  of  the  four  conditions  in  lines  3,  6,  and  8  in  the 
algorithm,  where  the  condition  in  line  8  can  be  considered  as  two  sequential  conditions,  the  first  testing  V 
and  if  it  is  equal  to  i,  the  second  testing  F.  The  sets  are  defined  as  follows; 


£  =  { j  I  Y  free  } . 


In  the  proof  we  use  the  following  notation:  V/,  j  6  U,  il  denotes  the  time  at  which  subprocess  i  executes 
line  /  in  the  algorithm,  and  il  <  jm  denotes  that  i  executes  line  I  before  j  executes  line  m. 


3.1  Mutual  Exclusion 

Let  IV  be  the  set  of  subprocesses  executing  their  critical  sections  at  time  t.  W  —  {i|i  £  A  \J  B  and 
jIO  <1<  jT2}.  To  prove  mutual  exclusion  it  suffices  to  prove  that  Vf,  |W|  <  1. 


Lemma  1:  Vt  £  AU  B,  jSj  €  -4  U  fl  such  that  j5  <  »12  <  jl2. 


By  defining  the  “order”  of  a  subprocess  to  be  the  number  of  subprocesses  that  set  Y  to  free  before  it  does, 
Lemma  1  can  be  proved  by  induction  on  the  order  of  subprocesses  in  AU  B.  A  complete  proof  is  presented 
in  the  appendix. 

Now  we  can  define  supersets  for  A  and  B  by  transforming  the  conditions  on  the  values  of  state  variables 
to  conditions  on  the  order  of  setting  and  reading  them  by  the  subprocess  under  consideration  and  other 
concurrent  subprocesses. 

Vi  £  A,  and  Vj  £  U  —  {i},  for  V'  to  be  equal  to  free  at  f3,  either  i3  <  jb  or  j\2  <  f3  (Lemma  1).  And 
for  X  to  be  equal  to  i  at  ?G,  either  j2  <  i2  or  i6  <  j2. 


Therefore,  A  C  <  ? 


I  £  U  and  Vj  £  U  -  {i}. 
j2  <  ?2  and  /3  <  j5  or 
iG  <  j2  or 

j2  <  i2  and  jl2  <  i3 


Vi  e  B  and  Vj  £  U  -  {/},  for  V'  to  be  equal  to  free  at  i3,  either  i3  <  j5  or  jl2  <  t3  (Lemma  1).  For  V 
to  be  equal  to  i  at  i8,  either  /8  <  j5,  j5  <  ib  and  i8  <  jl2,  or  jl2  <  t5.  And  for  F  to  be  equal  to  out  at  t8, 
either  f8  <  jlO  or  jl2  <  i8. 


Therefore,  B  C 


j  £  U  and  Vj  €  -  {i}. 

i8  <  jo  or 

j3  <  jb  <  ib  and  i8  <  jlO  or 
i3  <  jb  and  jl2  <  iT)  or 
jl2  <  /3 


Let  .4.4 


{(Lj)|i,  j  £  A  and  i  ^  j]  then  AA  C 


»,  j  £  U  such  that 
jG  <  »2  and  jl2  <  t3  or 
t6  <  j2  and  tl2  <  j3 


5 


Let  BB  =  {(Li)|f,  j  €  B  and  i  ^  j}  then  BB  C 


7 

7,  j  £U  such  that 

jb  <  ib  and  78  <  jb  and  j8 

<  7IO  or 

jb  <  75  and  712  <  jb 

or 

(*.i) 

il2  <  jb  or 

73  <  jb  and  j8  <  75  and  78 

<  jlO  or 

73  <  jb  and  jl2  <  ib 

or 

jl2  <  ib 

4 

Let  AB  =  {{i,j)\i  G  A  an^j  €  B),  then  AB  C  <  (i,j) 


i,  j  €  U  such  that 
j2  <  i2  and  i3  <  j5  and  j8  <  i5  or 
jf2  <  i2  and  j3  <  i5  <  j5  and  j8  <  tlO  or 
j2  <  i2  and  j3  <  f5  and  *12  <  jb  or 
»6  <  j2  and  il2  <  j3  or 
j2  <  i2  and  jl2  <  i3 


Let  G  5  and  j  G  .4}  then  Byl  =  {(i,i)|(i,i)  G  AB}. 


With  sufficient  delay, 


7,  j  G  U  such  that  } 

jb  <  7‘5  an  d  712  <  jb  or  j 

i,j  G  U  such  that  '} 

(*.j) 

7I2  <  jb  or  >  and  AB  C  < 

{i,j) 

j2  <  i2  and  jb  <  ib  and  712  <  jb  or  | 

7‘3  <  jb  and  jl2  <  ib  or  j 

76  <  j2  and  7'12  <  jb  or  | 

jl2  <  73  J 

j2  <  i2  and  jl2  <  73  J 

Let  Wo  =  e  W  and  i  ^  j},  then  Wo  C  AAU  BB  U  AB  U  BA.  Then,  =  0.  Then,  |WI  <  1. 

This  completes  the  proof  of  mutual  exclusion  □ 


3.2  Livelock  Freedom 

Lemma  2:  Vi  G  if  i  sets  V  to  i  and  terminates  at  time  t  while  V  =  i,  then  3j  S  A  \J  B  such 
that  jlO  <  t  <  jl2 

Proof: 

If  7  G  i4  U  B  then  i  sets  Y  to  free  and  terminates.  If  ?  G  C  then  ?'  terminates  while  Y  ^  i.  If  f  G  B 
then  i  never  sets  Y  to  i.  If  i  G  B  then  at  7'8,  F  =  in  then  it  must  be  the  case  that  G  U  B  such  that 
jlO  <  78  <  jl2.  If  jl2  <  78  then  i  terminates  while  Y  ^  i,  otherwise  i  terminates  after  jlO  and  before  jl2  O 

Lemma  3:  Vf  G  C,  >1  U  B  U  D  ^  0,  for  some  t,  ib  <1  <  iS 
Proof: 

Assume  that  the  lemma  is  false  i.e.  3?  G  C,  Vf,  i5  <  /  <  18,  A  U  B  U  D  =  0.  Since  i  G  C  then  at  t8,  V  ^  i. 
Hence  either  V  =  free  or  Y  =  j  ^  7,  where  j  G  U.  If  Y  =  free  then  3Jfc  G  A  U  B  such  that  i5  <  *12  <  18, 
which  contradicts  the  initial  assumption.  Thus  V  =  j  i.e.  3jGAUBUCUB  such  that  j  is  the  last  process 
to  set  y  between  ib  and  78.  Then  according  to  the  initial  assumption,  j  G  C.  Therefore  3*  G  AUBUCUD 
that  sets  V  between  jb  and  j8.  This  again  implies  that  it  G  C.  Since  j  is  the  last  subprocess  to  set  T 
before  78,  78  <  *5.  If  78  <  *3  then  3/  G  A  U  B  such  that  78  <  /12  <  *3  i.e.  jb  <  112  <  j8.  Contradiction. 
Therefore,  *3  <  78.  If  fS  <  *3  then  3/  G  A  U  B  such  that  tS  <  112  <  *3  i.e.  f5  <  /12  <  t8.  Contradiction. 
Therefore  *3  <  ib.  Thus  *3  <-*'5  and  7‘8  <  *5  which  is  not  possible  with  sufficient  delay.  Therefore,  the 
initial  assumption  is  false  and  the  lemma  is  true  □ 

Define  tj  and  <2  such  that  V/  G  U  i  starts  after  Ii  and  terminates  before  t2.  Assuming  that  the  critical 
sections  arc  finite,  (Ii .  <2)  can  be  chosen  to  be  finite.  Let  U'  be  the  set  of  subprocesses  running  in  the  interval 
(/i,/2).  and  similarly  define  A',  B',  C,  D',  and  E'.  It  is  clear  that  U  C  U',  A  C  A\  B  C  B',  C  C  C, 
D  C  D',  and  E  C  E'. 
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The  algorithm  is  livciock  free  if  f/  ^  0  implies  that  0.  Assuming  that  U  ^  9,3i  e  Al)  BUCU 

Due. 

If  i  e  E  then  at  ?3,  Y  free.  Hence  3j,  Y  =  j  and  j  ^  A' UB'UC'U  D'  which  is  the  last  to  set  V 
before  j3.  j  is  eitlier  running  at  i3  or  has  already  terminated.  If  j  is  running  then  A  U  U  CU  D  ^  0.  If  j 
has  already  terminated  then  according  to  Lemma  2,  3ib  G  A  U  B  that  was  executing  its  critical  section  while 
j  terminated.  It  cannot  be  the  case  that  kl2  <  i3  because  j  is  the  last  to  set  V  before  t3.  Therefore  it  must 
be  the  case  that  klO  <  f■3  <  fcl2  then  A  U  B  ^  0.  Therefore  B  ^  0  implies  that  A'  U  B'  U  C'  U  D'  ^0.  If 
i  G  C  then  according  to  Lemma  3,  A'UB'UD'  ^  0.  Finally,  if  i  G  B  then  at  i8,  F  ^  out.  Hence  3j  G  AUB 
such  that  jlO  <  f8  <  jT2.  Therefore,  A'  U  B'  7^  0. 

Therefore,  U  jt.  9  implies  that  A'  U  B'  ^  0.  This  completes  the  proof  of  livelock  freedom  □ 

4  Experiments 

In  this  section  we  present  the  experimental  results  of  implementing  three  mutual  exclusion  algorithms — 
Lamport’s  second,  Alur  and  Taubenfeld’s,  and  ours — on  an  8-processor  Silicon  Graphics  (SGI)  Iris  4D/480 
multiprocessor  and  on  a  larger  simulated  machine.  Based  on  relative  numbers  of  shared-memory  reads  and 
writes  (see  table  1),  we  expected  these  algorithms  to  dominate  the  others.  Among  them,  we  expected  the 
new  algorithm  to  perform  the  best,  both  with  and  without  contention.  We  also  expected  exponential  backoff 
to  substantially  improve  the  performance  of  all  three  algorithms. 

It  was  not  clear  to  us  a  priori  whether  Alur  and  Taubenfeld’s  algorithm  would  perform  better  or  worse 
than  Lamport’s  algorithm.  The  former  performs  more  shared  memory  references  on  its  fast  code  path,  but 
has  a  lower  asymptotic  complexity  on  its  slow  code  path.  How  often  each  path  would  execute  seemed  likely 
to  depend  on  the  effectiveness  of  backoff.  For  similar  reasons,  it  was  unclear  how  large  the  performance 
differences  among  the  algorithms  would  be.  Our  experiments  therefore  serve  to  verify  expected  relative 
orderings,  determine  unknown  orderings,  and  quantify  differences  in  performance. 

4.1  Real  Performance  on  a  Small  Machine 

To  obtain  a  bound  on  relative  rates  of  processor  execution,  we  exploited  the  real-time  features  of  SGl's 
IRIX  operating  system,  dedicating  one  processor  to  system  activity,  and  running  our  test  on  the  remaining 
seven  processors,  with  interrupts  disabled.  The  system  processor  itself  was  lightly  loaded,  leaving  the  bus 
essentially  free.  We  disabled  caching  for  the  shared  variables  used  by  the  lock  algorithms,  but  enabled  it 
for  private  variables  and  code.  We  compiled  all  three  locks  with  the  MIPS  compiler’s  highest  (-03)  level  of 
optimization. 

Wc  tested  two  versions  of  each  algorithm:  one  with  limited  exponential  backoff  and  one  without.  The  no¬ 
backoff  version  of  Lamport  ’s  algorithm  maf'-h‘>s  the  pseudo-»-ode  on  the  right  side  of  figure  1.  The  no-backoff 
version  of  the  new  algorithm  matches  the  pseudo-code  in  figure  2,  except  that  after  discovering  that  Y  ^  Iree 
in  line  3,  or  that  (V’.  F)  ^  (/,  out)  in  line  8.  we  wait  for  Y  =  Iree  before  returning  to  START.  Similarly,  the 
no-backoff  version  of  Alur  and  Taubcnfeld’s  loc\  matches  the  pseudo-code  in  figure  3,  except  that  (1)  if  V  ^ 
free  in  line  3,  wc  return  to  start  after  Y  becomes  free,  rather  than  continuing,  and  (2)  if  V'  ^  i  at  line  7, 
wu  wait  for  V'  =  free  before  returning  to  start.  In  the  backoff  versions  of  all  three  algorithms,  each  repeat 
loop  includes  a  delay  that  increases  geometrically  in  consecutive  iterations,  subject  to  a  cap.  The  ba.se, 
multiplier,  and  cap  were  chosen  by  trial  and  error  to  maximize  performance.  C  code  for  our  experiments 
can  be  obtained  via  anonymous  ftp  from  cayuga.es. rochester.edu  (directory  pub/scalable^ync/fast). 

Performance  results  appear  in  figures  4  and  5.  In  both  graphs,  point  [x,y)  indicates  the  number  of 
microseconds  required  for  one  processor  to  acquire  and  release  the  lock,  when  x  processors  are  attempting 
to  do  so  simultaneously.  Th^sio  numbers  are  derived  from  program  runs  in  which  each  processor  executes 
100,000  critical  sections,  Within  tlie  critical  section,  each  processor  increments  a  shared  variable.  After 
releasing  the  lock,  the  processor  executes  only  loop  overhead  before  attempting  to  acquire  the  lock  again. 
Program  runs  were  repeated  several  times:  reported  results  are  stable  to  about  ±2  in  the  third  significant 
digit.  The  one-processor  points  indicate  the  time  to  acquire  and  release  the  lock  (plus  loop  overhead)  in  the 
absence  of  contention.  Points  for  two  or  more  processors  indicate  the  time  for  one  processor  to  pa,ss  the  lock 
on  to  the  next. 


processors 

Figure  A:  Performance  results  without  backoff  on  the  SGI  Iris. 
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Figure  5:  Performance  results  with  backoff  on  the  SGI  Iris. 
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code  path  of  the  delay-hased  algorithms. 


Backoff  is  clearly  important.  Without  it.  performance  degrades  rapidly  with  increasing  contention.  Lam¬ 
port’s  algorithm  de;;ra(les  smoothly,  while  the  new  algorithm  and  that  of  Alur  and  Taubenfeld  behave 
erratically  (see  o  low).  By  conlr.xst,  with  backoff,  performance  of  all  three  algorithms  is  excellent,  and 
roughly  proportional  to  the  mttnher  of  shared  memory  references  on  the  fast  code  p  'fh.  With  only  six  such 
references,  the  new  algorithm  is  the  fastest. 

We  instrumented  the  two  delay-based  algorithms  in  an  attempt  to  explain  the  strange  (but  highly  re¬ 
peatable)  behavior  of  the  new  algorithm  and  that  of  Alur  and  Taubenfeld  in  the  no-backoff  experiments. 
'I'he  results  appear  in  table  2.  We  hypothesize  that  with  odd  numbers  of  processors  there  is  usually  one 
that  is  able  to  enter  its  critical  section  without  executing  a  delay,  wliile  with  even  numbers  of  processors  the 
test  falls  into  a  mode  in  which  all  processors  are  frequently  delayed  simultaneously,  with  none  in  the  critical 
.section.  Tliis  hypothesis  is  consistent  with  memory  reference  traces  recorded  for  similarly  anomalous  points 
in  the  simulatioti  experiments,  as  discussed  in  the  following  section. 

Surprisingly,  all  three  algorithms  with  backoff  outperform  the  native  test-and-setlocls  supported  in 
hardware  on  the  SGI  machine.  These  native  locks  employ  a  separate  synchronization  bus,  and  are  generally 
consi<leri'd  very  fast.  With  backoff,  processes  execute  the  fa.st  path  in  almost  every  lock  acquisition.  This 
explaiits  the  observation  that  relative  performance  of  the  three  locks  is  proportional  to  the  number  of  shared 
memory  '■eferences  in  the  fast  paths  in  their  ncquisition/relea.se  protocols. 

4.2  Simulated  Performance  on  a  Large  Machine 

To  investigate  the  effect  of  backoff  oi;  fast-mutual  exclusion  algorithms  with  only  atomic  /ead  and  write,  and 
to  evaluate  tlu’  relative  perfornmne •'  of  the  three  algorithms  when  there  is  a  higher  level  of  contention  on 
a  large  number  of  processors,  we  simulated  the  execution  of  these  three  lock  algorithms  on  a  hypothetical 
large  machine  with  128  processors. 

Our  simulations  use  the  same  executable  program  employed  on  the  SGI  machine.  It  runs  this  program 
under  V'eenstra's  MIPS  interpreter.  Mint  [ft],  with  a  simple  back  end  that  determines  the  latency  of  each 
reference  to  shared  memory.  We  a.ssume  that  shared  memory  is  iincached,  that  each  memory  request  spends 
.K)  cycles  in  each  direction  traversing  some  sort  of  processor/memory  interconnect,  that  competing  requests 
queue  up  at  the  memory,  and  that  the  memory  can  retire  one  request  every  10  cycles.  The  minimum  time 
ftir  a  shared-memory  reference  is  therefore  82  cycles.  For  the  delay-ba.sed  algorithms,  we  used  a  delay  of 
2500  cycles,  which  provides  enough  time  for  the  nremory  to  service  2  requests  from  each  of  128  processors. 

Figures  ti  and  7  show  that  the  performance  of  all  three  algorithms  (and  Lamport’s  in  particular)  improves 
substantially  with  the  use  of  exponential  backoff.  Thus  backoff  makes  mutual  exclusion  feasible  even  for  large 
numbers  of  processors,  with  no  atomic  instructions  other  than  read  and  write. 

For  figure  7,  backoff  constants  (lia.se.  multiplier,  and  cap)  were  selected  for  each  algorithm  to  maximize  its 
performance  on  128  processors.  On  smaller  numbers  of  processors  this  backoff  is  too  high,  and  performance 
is  unstable.  With  greater  than  32  processors,  the  relative  order  of  the  algorithms  remains  th«  same  over  a 
wide  range  of  possible  backoff  constants.  Most  of  the  individual  data  points  reflect  simulation  runs  in  which 
each  processor  executes  100  critical  sections.  We  ran  longer  simulations  on  a  subset  of  the  points  in  order 
to  verify  that  the  total  number  of  elapsed  cycles  wa,s  linearly  proportional  to  the  number  of  critical  section 
executions. 

All  three  algorithms  were  found  to  be  sensitive  not  only  to  the  choice  of  backoff  constants,  but  also 
to  critical  and  non-crilical  section  lengths.  With  many  variations  of  these  parameters,  the  overall  relative 
performance  of  the  three  algorithms  was  always  found  to  be  the  same.  The  presented  results  are  with  a  single 
shared-memory  update  in  each  critical  section,  and  nothing  but  loop  overhead  in  the  non-critical  sections. 
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I’nstnhio  parts  oftlic  graplis  in  figures  (5  ami  7  were  investigated  using  detailed  traces  of  shared  memorv 
rclerences.  The  apparently  anomalous  points  can  be  attributed  to  the  big  difference  in  execution  tiine 
betweeti  the  fast  and  the  slow  paths  of  the  algorithms.  With  many  variations  of  the  backoff  constants  and 
length  of  critical  and  non-critical  sections,  there  are  always  points  (numbers  of  proce.ssors)  where  most  of 
the  time  a  processor  executes  the  slow  path  to  acquire  and  release  the  lock.  But  these  points  were  found  to 
change  with  different  combinations  of  parameters. 

rite  simulation  results  verify  that  the  new  algorithm  outperforms  the  others  with  its  low  number  of  shared 


memory  references  in  the  fast  path.  For  large  numbers  of  processors,  Alur  and  Taubenfeld's  algorithm  always 
outperforms  Lamport  s  algorithm  despite  its  higher  number  of  shared  memory  references  in  the  fast  path, 
due  to  the  increasing  cost  of  the  slow  path  of  Lamport's  algorithm. 


5  Conclusions 

Fast  mutual  exclusion  with  only  reads  and  writes  is  a  topic  of  considerable  theoretical  interest,  and  of  some 
practical  interest  as  well.  We  have  presented  a  new  fast  mutual  exclusion  algorithm  that  has  an  asymptotic 
time  complexity  of  0(n)  in  the  presence  of  contention,  while  requiring  only  2  reads  and  4  writes  in  the  absence 
of  contention.  The  algorithm  capitalizes  on  the  ability  of  most  memory  systems  to  read  and  write  atomically 
at  both  full-  and  half-word  granularities.  The  same  asymptotic  result  has  been  obtained  independently 
by  Alur  and  Tanbenfeld,  without  the  need-for  multi-grain  memory  operations,  but  with  a  higher  constant 
overhead:  3  reads  and  5  writes  on  the  fast  code  path. 

From  a  practical  point  of  view,  our  results  confirm  that  mutual  exclusion  with  only  reads  and  writes  is 
a  viable,  if  not  ideal,  means  of  synchronization.  Its  most  obvious  potential  problem — contention — can  be 
mitigated  to  a  large  extent  by  the  use  of  exponential  backoff. 

Most  modern  microprocessors  intended  for  use  in  multiprocessors  provide  atomic  instructions  designed 
for  synchronization  (test.and-set,  swap.  compare.and-SBap.  fetch.and.add,  loadJ.inked/store.condi- 
tional.  etc).  For  those  that  do  not .  system  designers  are  left  with  the  choice  between  implementing  hardware 
synchronization  outside  the  processor  (as  in  the  synchronization  bus  of  Silicon  Graphics  machines),  or  em¬ 
ploying  an  algorithm  of  the  sort  discussed  in  this  paper.  Backoff  makes  the  latter  option  attractive. 

On  the  SGI  Iris,  our  tiew  algorithm  outperforms  the  native  hardware  locks  by  more  than  30%.  For 
arbitrary  user-level  progratiis,  which  catmot  assume  predictable  e.xecution  rates,  Lamport's  second  algorithm 
(with  backoff)  outperforms  the  native  locks  by  25%.  These  results  are  reminiscent  of  recent  studies  by  Yang 
and  Anderson,  who  found  that  their  hierarchical  read-  and  write-based  mutual  exclusion  algorithm  (line 
•1  in  table  1)  provided  performance  competitive  with  that  of  fetchjmd.4>-based  algorithms  on  the  BBN 
TC200()  [10],  Both  Lamport's  second  algorithm  and  Yang  and  Anderson’s  algorithms  require  space  per  lock 
linear  in  the  number  of  contending  processes.  For  systems  with  very  large  numbers  of  processes,  Merritt  and 
Taubenf'’ld  have  proposeil  a  technique  that  allows  a  process  to  register,  on  the  fly,  as  a  contender  for  only 
the  locks  that  it  will  actually  be  using  [0]. 

For  the  designers  of  microprocessors  and  multiprocessors,  we  remain  convinced  that  the  most  cost- 
effective  synchronization  tnechanisms  are  algorithms  that  use  simple  fetchjaid.4>  instructions  to  establish 
links  between  processes  that  then  spin  on  local  locations  [5].  For  machines  without  appropriate  instructions, 
however,  fast  mutual  exclusion  remains  a  viable  option. 


References 

[1]  R.  Alur  and  G.  Tanbenfeld.  Results  about  Fast  Mutual  Exclusion.  Technical  report.  ATA’T  Bell 
Laboratories.  5  Jatiuary  1993.  Revised  version  of  a  paper  presented  at  the  Thirirrnih  IEEE  Real-Time 
St/slrms  Symposium,  D<'cember  1992. 

[2]  T.  F,.  Anderson.  The  Performance  of  Spin  Lock  Alternatives  for  Shared-Memory  Multiprocessors. 
IEEE  Transactions  on  Parallel  and  Distributed  Systems,  l(l):r)-lC,  January  1990. 

[3]  F,.  \V.  llijkstra.  C'o-oi>erating  seqitential  processi's.  In  F.  Genttys.  editor.  Programming  Languages, 
pages  43-1 12.  Acadf'mic  Press,  Lotidon,  1968. 


II 


[4]  L.  Lamport.  A  Fast  Mutual  Exclusion  Algorithm.  ACM  Transaciions  on  Computer  Systems, 

February  1987. 

[5]  J.  M.  Mellor-Crummcy  and  M.  L.  Scott.  Algorithms  for  Scalable  Synchronization  on  Shared-Memory 
Multiprocessors.  ACM  Transactions  on  Computer  Systems,  9(l):21-65,  February  1991. 

[6]  M.  Merritt  and  G.  Taubenfeld.  Speeding  Lamport’s  Fast  Mutual  Exclusion  Algorithm.  Information 
Processing  Letters,  45(3):137-142,  March  1993. 

[7]  G.  L.  Peterson.  Myths  About  the  Mutual  Exclusion  Problem.  Information  Processing  Letters, 
12(3):115-116,  June  1981. 

[8]  E.  Styer.  Improving  Fast  Mutual  Exclusion.  In  Proceedings  of  the  Eleventh  ACM  Symposium  on 
Principles  of  Distributed  Computing,  pages  159-168,  Vancouver,  BC,  Canada,  9-12  August  1992. 

[9]  J.  E.  Veenstra.  Mint  Tutorial  and  User  Manual.  TR  452,  Computer  Science  Department,  University 

of  Rochester,  May  1993.  _ 

[10]  H.  Yang  and  J.  H.  Anderson.  Fast,  Scalable  Synchronization  with  Minimal  Hardware  Support  (extended 
abstract).  In  Proceedings  of  the  Twelfth  ACM  Symposium  on  Principles  of  Distributed  Computing,  (to 
appear)  15-18  August  1993. 


A  Proof  of  Lemma  1 


Lemma  1:  'ii  ^  A\JB,  e  AK^  B  such  that  j5  <  i\2  <  jl2. 

Proof: 

Let  the  order  of  a  subprocess  in  ^4  U  5  be  the  number  of  subprocesses  that  set  Y  to  free  before  it  does. 
A  proof  by  induction  on  the  order  of  subprocesses  involves  proving  that:  (1)  The  lemma  is  true  for  the 
subprocess  of  order  0,  and  (2)  If  the  lemma  is  true  for  subprocesses  of  order  less  than  n,  then  it  is  true  for 
the  subprocess  of  order  n. 

Basis: 

•  Let  i  be  the  subprocess  of  order  0,  and  assume  that  3j  £  AU  B  such  that  j5  <  il2  <  jl2. 

•  Assume  that  t5  <  i3.  Since  j5  <  il2  then  jZ  <  fl2.  Then  f5  <  <  *12.  Then  for  Y  to  be  equal  to 

free  at  j3  it  must  be  the  case  that  3k  £  AU  B  that  sets  Y  to  free  before  j3  i.e.  before  *12.  Then  t 
is  of  order  greater  than  0.  Contradiction.  Therefore,  it  must  be  the  case  that  jZ  <  *5. 

•  Assume  that  j5  <  *3.  Since  *12  <  jl2  then  *3  <  il2.  Then  jb  <  *3  <  jl2.  Then  for  Y  to  be  equal  to 
free  at  *3  it  must  be  the  ca.se  that  3k  £  AKJB  that  sets  Y  to  free  before  *3  i.e.  before  *12.  Then  » is 
of  order  greater  than  0.  Contradiction.  Therefore,  it  must  be  the  case  that  *3  <  jb. 

•  Considering  the  four  possible  cases  for  »  and  j  belonging  to  A  or  B: 

1.  *  and  j  £  A:  Since  *3  <  jb  then  *2  <  jb  then  for  A'  to  be  equal  to  j  at  jb  it  must  be  the  case 
that  *2  <  j2.  Since  jZ  <  ib  then  j2  <  ib.  Then  *2  <  j2  <  *6  then  at  *6  A'  ^  *.  Then  *  ^  A. 
Contradiction. 

2.  i  £  A  and  j  £  B: 

-  Assume  that  jb  <  ib.  For  Y  to  be  equal  to  j  at  jS  it  must  be  the  case  that  j8  <  ib.  Then 

*3  <  jb  and  j8  <  ib,  which  is  not  possible  with  sufficient  delay.  Therefore,  it  must  be  the 

case  that  ?5  <  jb. 

-  Since  ib  <  jb  then  with  sufficient  delay  it  must  be  the  case  that  *10  <  j8.  For  F  to  be  equal 
to  out  at  j8  it  must  be  the  case  that  3A-  €  A\J  B  such  that  *10  <  tl2  <  j8  (it  is  possible 
that  k  =  *).  If  k  =  *  then  for  Y  to  be  equal  to  j  at  jS,  *12  <  jb  which  contradicts  the  initial 
a.ssumption.  If  k  ^  *  then  1'12  <  jb  <  *12,  then  »  is  not  of  order  0.  Contradiction. 

3.  i  £  B  and  j  £  A: 

-  Assume  that  *5  <  jb.  For  Y  to  be  equal  to  »  at  *8  it  must  be  the  case  that  *8  <  jb.  Then 

j3  <  ib  and  *8  <  jb,  which  is  not  possible  with  sufficient  delay.  Therefore,  it  must  be  the 

c*ise  that  jb  <  ib. 

-  Since  jb  <  ib  then  with  sufficient  delay  it  must  be  the  case  that  jlO  <  *8.  For  F  to  be  equal 
to  out  at  *8  it  must  be  the  case  that  3k  £  AU B  such  that  jlO  <  kl2  <  *8  (it  is  possible  that 
k  =  j).  Then  *  is  not  of  order  0.  Contradiction. 

4.  i  and  j  £  B:  For  Y  to  be  equal  to  *  and  j  at  *8  and  j8  respectively,  it  must  be  either  the  case 
that  *8  <  jb  or  j8  <  ib.  Then  it  must  be  either  the  case  that  *3  <  jb  and  j8  <  *'5;  or  j3  <  i5  and 
*8  <  jb.  Both  cases  are  not  possible  with  sufficient  delay. 

•  Therefore,  j3j  £  A\J  B  such  that  jb  <  *12  <  j\2,  i.e.  the  lemma  is  true  for  the  subprocess  of  order  0. 
Induction 

•  Assume  that  the  lemma  is  true  for  all  wbprocesses  of  order  less  than  n.  Let  i  £  AUfl  be  the  subprocess 
of  order  *?. 

•  Assume  that  3j  £  .4  U  B  such  that  jb  <  *12  <  jl2. 
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•  Assume  that  i5  <  j3.  Since  j5  <  *12  then  <  *12.  Then  *5  <  j3  <  *12.  Then  for  Y  to  be  equal  to 
free  at  j3  it  must  be  either  the  case  that  3k  G  AV  B  such  that  *5  <  k\2  <  j3  i.e.  *5  <  Jl:12  <  *12. 
This  contradicts  the  inductive  hypothesis.  Therefore,  it  must  be  the  case  that  j3  <  *5. 

•  Assume  that  j5  <  *3.  Since  *12  <  jl2  then  *3  <  jl2.  Then  j5  <  *3  <  jl2.  Then  for  Y  to  be  equal 
to  free  at  *3  it  must  be  the  case  that  3/b  6  A  U  B  such  that  j5  <  kl2  <  *3  i.e.  j5  <  A:12  <  jl5  and 
ifcl2  <  *12.  This  contradicts  the  inductive  hypothesis.  Therefore,  it  must  be  the  case  that  *3  <  j5. 

•  Considering  the  four  possible  cases  for  »  and  j  belonging  to  A  or  B: 

1.  *  and  j  €  A:  Since  i3  <  j5  then  *2  <  j6  then  for  X  to  be  equal  to  j  at  j6  it  must  be  the  case 
that  *2  <  j2.  Since  jZ  <  *5  then  j2  <  *6.  Then  *2  <  j2  <  *6  then  at  *6  X  i.  Then  *  ^  A. 
Contradiction. 

2.  i  £  A  and  j  £  B: 

—  Assume  that  j5  <  *5.  For  Y  to  be  equal  to  j  at  it  must  be  the  case  that  y8  <  *5.  Then 
*3  <  j5  and  j8  <  *5,  which  is  not  possible  with  sufficient  delay.  Therefore,  it  must  be  the 
case  that  *5  <  j5. 

-  Since  *5  <  j5  then  with  sufficient  delay  it  must  be  the  case  that  *10  <  j8.  Then  j3  <  *10  <  j8. 
For  F  to  be  equal  to  out  at  j8  it  must  be  the  case  that  3k  £  A\J  B  such  that  *10  <  kl2  <  jS 
(it  is  possible  that  k  =  *).  If  it  =  *  then  for  Y  to  be  equal  to  j  at  jS,  *12  <  jb  which  contradicts 
the  initial  assumption.  If  k  ^  *  then  itl2  <  jb  <  *12,  then  *10  <  F12  <  *12,  which  contradicts 
the  inductive  hypothesis. 

3.  i£  B  and  j  £  A: 

-  Assume  that  *5  <  jb.  For  Y  to  be  equal  to  *  at  *8  it  must  be  the  case  that  *8  <  jb.  Then 
jZ  <  *5  and  *8  <  jb,  which  is  not  possible  with  sufficient  delay.  Therefore,  it  must  be  the 
case  that  jb  <  ib. 

-  Since  jb  <  ib  then  with  sufficient  delay  it  must  be  the  case  that  jTO  <  *8.  Then  *3  <  jlO  <  *8. 
For  F  to  be  equal  to  out  at  *8  it  must  be  the  case  that  3k  £  A  U  B  such  that  jTO  <  kl2  <  *8 
(it  is  possible  that  k  =  j).  If  k  =  j  then  for  Y  to  be  equal  to  *  at  *8,  jT2  <  *5  which 
contradicts  the  initial  assumption.  If  k  ^  j  then  kl2  <  ib  <  jl2,  then  jTO  <  kl2  <  jl2, 
which  contradicts  the  inductive  hypothesis. 

4.  *  and  j  £  B:  For  Y  to  be  equal  to  *  and  j  at  *8  and  jS  respectively,  it  must  be  either  the  case 
that  *8  <  jb  or  jS  <  ib.  Then  it  must  be  either  the  case  that  *3  <  jb  and  y8  <  *5;  or  jZ  <  *5  and 
*8  <  jb.  Both  cases  are  not  possible  with  sufficient  delay. 

•  Therefore,  fij  £  AU  B  such  that  jb  <  *12  <  jl2  □ 
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